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Omics experiments endowed with a time-course design may enable us to uncover the dynamic inter- 
play among genes of cellular processes. Multivariate techniques (like VAR(1) models describing the 
temporal and contemporaneous relations among variates) that may facilitate this goal are hampered 
by the high-dimensionality of the resulting data. This is resolved by the presented ridge regularized 
maximum likelihood estimation procedure for the VAR(1) model. Information on the absence of tem- 
poral and contemporaneous relations may be incorporated in this procedure. Its computational efficient 
implemention is discussed. The estimation procedure is accompanied with an LOOCV scheme to de- 
termine the associated penalty parameters. Downstream exploitation of the estimated VAR(1) model 
is outlined: an empirical Bayes procedure to identify the interesting temporal and contemporaneous 
relationships, impulse response analysis, mutual information analysis, and covariance decomposition 
into the (graphical) relations among variates. In a simulation study the presented ridge estimation 
procedure outperformed a sparse competitor in terms of Frobenius loss of the estimates, while their 
selection properties are on par. The proposed machinery is illustrated in the reconstruction of the p53 
signaling pathway during HPV-induced cellular transformation. The methodology is implemented in 
the ragt2ridges R-package available from CRAN. 


Keywords: Cervical cancer; Constrained estimation; Known network support; Maximum 
likelihood; Vector autoregressive process. 
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1 Introduction 


Molecular biology aims to unravel the workings of the cell. This amounts to which and how molecules 
like mRNAs work together. A widely accepted paradigm assumes a modular set-up (Alon, 2003). 
Modules, called pathways, comprise groups of genes that work together to (for example) process a 
signal (which may then be passed on to another pathway). For some pathways it is (at least partially) 
known how genes interact. Often this knowledge is derived from experiments on properly working 
model systems. Cancer, however, is due to a (possible multiple) failure(s) at the molecular level of 
the cell (Weinberg, 2006). Such failures may cause pathways to be dysregulated in cancer cells. The 
gene-gene interaction pattern may thus be altered. Knowledge of the regulatory patterns of pathways 
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in cancer cells is of paramount importance for the development of molecular medicines that specifically 
target the failure (or consequences thereof). 

For the elucidation of the gene—gene interactions within the cancer cell we consider oncogenomics 
studies with a time-course set-up. The time-course set-up implies that each sample included in the 
study is followed over time and, at multiple time points during this period, interrogated molecularly. 
Time-course studies are more powerful to investigate (dynamical aspects of) the regulatory system. 
Consider the following representative example of a cervical cancer genomics experiment with a time- 
course set-up (which is analyzed later in Section 6). The human papilloma virus (HPV), a carcinogenic 
entity, is introduced in normal cells, generating a HPV-immortalized cell line. In our cervical cancer 
experiment two cell lines are infected with HPV16 and two with HPV18 (Steenbergen et al., 1996). The 
resulting cell lines are shown to faithfully mimic cervical cancer development, morphologically, and 
(epi)genetically (Steenbergen et al., 2004; Wilting et al., 2006; Henken et al., 2007). After introduction 
of the virus the cells are maintained in culture and go through distinct phenotypic stages to finally 
become fully transformed, as illustrated by their ability to grow anchorage independent. During this 
transformation process, which can take many population doublings, cells can be profiled at different 
stages of transformation. In the cervical cancer experiment cells from each cell line are profiled by 
oligonucleotide microarrays at eight uniformly distributed time points. This will provide insight into 
the changes in expression over time underlying the cellular process of carcinogenesis. 

Statistically, the dynamics of the gene expression levels within a pathway may be described by a 
vector autoregressive model, abbreviated to VAR(1) model where “1” specifies the lag. The VAR(1) 
model specifies the distribution of the process but does not directly reveal the underlying network of 
(cross-)temporal and contemporaneous relations among its variates. This is captured by its associated 
time-series chain graph (Dahlhaus, 2000), which describes the conditional independencies among 
variates over time. The time-series chain graph suggests how molecular medicine could target the 
pathway. 

VAR(1) models are well-studied (confer monographs Hamilton, 1994 and Liitkepohl, 2005), in 
particular their full maximum likelihood (ML) estimation. Little attention has been given to full 
ML estimation from high-dimensional data as typically arise from time-course genomics studies. To 
our knowledge only Abegaz and Wit (2013) address this problem by augmenting the likelihood of the 
VAR(1) model with SCAD (smoothly clipped absolute deviation) penalties (Fan and Li, 2001). Here we 
contrast this by considering ridge penalization instead. This is motivated from penalized estimation of 
Gaussian graphical models (GGM) used in the reconstruction of the gene—gene interaction network 
from observational genomics studies. In a vis-a-vis data-driven comparison of the lasso and ridge 
estimators (of Friedman et al., 2008 and Van Wieringen and Peeters, 2016, respectively) of the GGM, 
the latter performed (in contrast to the former) favorably in terms of loss and comparably in terms 
of selection properties (Van Wieringen and Peeters, 2016). With the SCAD penalty being closely 
related to the lasso, the favorable behavior of ridge estimation of GGM may carry over to ridge 
estimation of VAR(1) models and may benefit the reconstruction of their associated time-series chain 
graph. 

The outline of the paper is as follows. First, the VAR(1) model and its properties along with 
its associated time-series chain graph are recapitulated. With this knowledge refreshed, the ridge 
penalized full ML estimator of the VAR(1) model is presented. The estimator is extended to allow the 
incorporation of prior knowledge on the support of both temporal and contemporaneous interactions. 
In both cases memory efficient evaluation of the estimator is outlined. Cross-validation (which requires 
minor changes to the estimator) is described to guide the choice of the penalty parameters. Then, 
several strategies (e.g. selection of temporal and contemporaneous relationships, mutual information, 
and path analysis) for down-stream exploitation of the estimated model are discussed. This 1s followed 
by a comparison of the SCAD and ridge penalized ML estimation procedures of the VAR(1) model. 
Finally, the presented methodology is applied to the aforementioned HPV-induced transformation 
study. 
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2 The VAR(1) model 


Consider an experiment in which 1 samples (cell lines) are followed over time. Of each sample p variates 
(e.g. mRNA levels) are measured at T time points. The time points coincide among the samples. Let 
Y,,,;, be a random variable representing the measurement of variate j in sample i at time point ¢. The 
vector comprising the measurements of all variates of sample i at time point f is denoted Y,, , ;. 

The data from the time-course experiment are modeled by a VAR(1) (first-order vector autoregres- 
sive) process: 
|Y Veni = Yanr) Yarn = Ub AY 1a + eee (1) 


Vos *et—Lirrs 


where v the p x | intercept vector, A a p x p mshi Ge coefficient matrix, and e, ,; a p x 1 vector 
with the errors. Throughout it is assumed that v= 0,,,., (data should thus be centered per variate 
and within-sample), e, ,,; ~ NV (0,1, Z,) and Cov(e ) =0ift, At, ori, Fi,. Finally, the 
VAR(1) process is assumed to be stationary (having 7 first and second moment). Under the above 
assumptions the mean of Y, , ; equals 0,..,; (due to v = 0,,,.,). Furthermore, from model (1) it is clear 
that the variance of Y, , ; satisfies: Z, = AZ,A' + Z,, which may be solved for Z,, explicitly in terms 
of A and &,. Finally, the autocovariance (the covariance between the variates at one time point 
and some other) of the VAR(1) process is: T(t) = Cov(Y Y = x, [A‘]" for t > 0. Similarly, 
I(t) =A", fort < 0. 

The maximum likelihood (ML) estimators of the parameters of the VAR(1) model can readily be 


*,0,0 * tt? €..1,.i, 


*, 67? oped 


obtained. The likelihood of model (1) is, using Y, , ;| ¥,.,-)_; ~ N(AY,, ,_) ;, 2): 
nT 1 1 - 
-1 
L(Y; A, x a= — HII (27 )P/2| 5, 172 © exp 5 (Yai AY, 11,1) x, (Yai ~~ AV) | . 


Maximization of this likelihood yields: 


n T-1 n T-1 =1 
A= ary EE Yoon ‘|| ay jy De De Mant vi ’ 


i=1-7=1 i=! .f=1 


1 nT 
x, a nT De » are > AY, 1 ][¥a.1,i a AV 263]: 


i=l t=2 


Thus, A is estimated by the product of the sample estimates of the autocovariance T(—1) and the 
variance I'(0), while Z, is estimated by the sample residual covariance matrix. Both ML estimators (of 
A and &,) are biased but consistent (Nicholls and Pope, 1988; Liitkepohl, 2005). Finally, we remark 
that the OLS (ordinary least squares) and ML estimators of A coincide (in the unpenalized case). 

The VAR(1) model is well-known and only those properties relevant to the remainder of the paper 
have been recapitulated above. The reader is referred to the standard textbooks of Hamilton (1994) 
and Lutkepohl (2005) for more on the VAR(1) model. 


2.1 Conditional independency 


The conditional independencies among the variates of Y, ,; as implied by the VAR(1) model are 
represented by a conditional independence graph (Whittaker, 1990). Such a graph G 1s a pair (VY, €), 
where Y is a finite set of nodes (also called vertices) of graph components (here variates 7 = 1,..., p). 
The € is a subset of V x VY of ordered pairs of nodes, called edges of the graph. The nodes are the 


© 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com 


Biometrical Journal 59 (2017) 1 175 


entities of interest and the edges represent the relationships among the entities. If both ordered pairs 
(v,,¥>) and (¥,, v,) belong to €, the graph G has an undirected edge between v, and v,. The edge 
is directed if only one of the two is in €. An undirected graph comprises undirected edges only. An 
undirected graph G is a conditional independence graph for the VAR(1) process if (v;,v,) gE => 
Y, i LY, wil Yup, o,)i Where, for example Y,, ; represents the subprocess of variate v over time. 


al SRL 

Helpful in the understanding of the relation between the parameters and conditional indepen- 
dencies of a VAR(1) process is a time-series chain graph, which represents the local (time-wise) 
conditional independencies of the process. Following Dahlhaus (2000) and Dahlhaus and Eichler 
(2003), a time series chain graph Gr; = (Vrs, Eps) is a mixed graph, a graph containing both di- 
rected and ee edges. The vertex set V;, comprises of the variates at all time points, that is 
Vrs = {.--5Vi1, Vis Vi41, -- J, while the edge set €;., constitutes of all temporal and contemporaneous 
edges. Temporal edges are directed and (in the VAR(1) model) connect nodes from V, to that of V,,, 
for any ¢t € Z. Presence or absence of these edges correspond to nonzero or zero (respectively) elements 
of A, where the latter implies the conditional independence relation Ye gel Y, garel Saad s Tha 
Contemporaneous edges are undirected and connect the nodes within Y,. Presence or absence of con- 
ae aad edges match with nonzeros or zeros in the error precision matrix Z;!, where the zeros 
imply Y; cL OY, al Vane UY), sti? Yiaib Yous 

Globil idl iadependenes among subprocesses of the VAR(1) process hold if the partial 
correlations among the subprocesses vanish (Theorem 3.1 of Dahlhaus and Eichler, 2003). Proposition 
2.7 of Dahlhaus and Eichler (2003) specifies a parametric criterion for the vanishing partial correlations 
in multivariate time series. For the VAR(1) process these results imply that Yet EY al Yoana 
if 


Vi, T 


a) (Ze) ji, =0, a 

b) (ATZS'), = P(A); jee jy, = 0, and 

c) ee ER (A), ay = 0, and 

d) (A'z 2 AD) = = i pa (A); (Ze! a7 (A) jr 3, =0. 
The parametric criteria can be translated to the time series chain graph: subprocesses Y, i and 
Y, jyei ate conditionally independent if there is a) no contemporaneous edge between j; and Jo, 6) 


no temporal edge from j, to any 7 (i.e. no (A) jj, # 0) with an contemporaneous edge to j,, c) no 
temporal edge from j, to any j with a contemporaneous edge to j,, and d) no temporal edges from 
j, and j, to any j and j’ that are connected contemporaneously. Hence, conditional independencies 
between subprocesses can be read off the time-series chain graph. 


3 Ridge estimation of the VAR(1) model 


In this section, the parameters of the VAR(1) model (1) with vy = 0,,.; (requiring variate-wise and 
within-sample zero-centering of the data) are estimated from high-dimensional data by means of ridge 
penalized likelihood maximization. The resulting estimators are shown to be efficiently computable 
from high-dimensional data. Moreover, the estimators may be endowed with a “data augmentation” 
and Bayesian interpretation. Finally, the ridge estimation procedure is modified to allow prior knowl- 
edge on the support of temporal (Section 3.3) and contemporaneous (Section 3.4) edges of time-series 
chain graph. 


3.1 Ridge maximum likelihood 


The log-likelihood augmented with ridge penalties (rewritten as e.g. A, || All5 = A, ji. h _,[(A) ‘ak = 
i, tr [vec(A)' vec(A)]) for A and , is: 
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LP (Y; A, @,; 2,,, 2,4, Ag: Bo) = log[L(Y; A, 2,)]— 5 — 1), tr[(@, — 2))' (2, — Q,)] 
~ 5uT — 1)A,tr {[vec(A) — vec(Ay)]' [vec(A) — vec(Ay)]}, 


in which Q, = Z;!. The p x p matrices Ay and Q) are so-called target matrices. The ridge estimates 
of A and &, are shrunken (with increasing values of the penalty parameters) toward these matrices. 
The target matrices may represent user-specified prior knowledge on the parameters. In the envisioned 
usage A, and Q, are estimated from a pilot study or from publicly available data of a related dis- 
ease/organ/organism and plugged in as targets when estimating the VAR(1) model from the study at 
hand. In case of the error precision matrix, Van Wieringen and Peeters (2016) introduce a target for 
Q,, to ensure that the estimate is shrunken toward a well-defined, positive definite precision matrix. In 
the remainder A, and Q, are included in the estimators, but set equal to zero in simulation (to ensure 
a fair comparison) and application (no pilot data at hand). 

Maximization of the ridge penalized log-likelihood proceeds as usual. Take the derivative of the pe- 
nalized log-likelihood with respect to vec(A) and equate this derivative to zero to obtain the estimating 
equation of A. Solve the estimating equation for A to arrive at the ridge ML estimator: 


n oT = 
vec [A(A)] = [er = Dada +. >> Meier Yoir1 B 2..)| 
f=1 t=2 
he E 
x [mT — Da,veo(Ay) + > D> (Yer @LIY¥. | (2) 


i=1 t=2 


where ® denotes the Kronecker product (Van Loan, 2000). This estimator involves p* x p* dimensional 
matrices, making the computation of Estimator (2) impractical even for modestly sized pathways. But 
numerical evaluation of the ridge ML estimator can be done efficiently when the estimator is reformu- 
lated algebraically. Hereto recall the following properties of the Kronecker product (confer Harville, 
2008). Let A, B, and C be p x p dimensional symmetric matrices, VD. V| the eigen-decomposition 
of matrix X = A, B, or C, and 4 € R. Then: 

i) A®B+B@C= (A+B) @C, 

ii) (A' @ B)vec(C) = vec(BCA), 
iii) AL, 2p +A@B= (V,@V, AI, 2 +D, @ D,)(V; @®V;}). 


Application of the first two properties yields: 
vec{A (A.,)] = [Agl,2.2 + F(0) ® B,J {A,,vec(Ay) + vec[@,F(—1)]}. 
Using the second and third property we get: 


vec{A(2,)] = vec[V,,(vectdiagl (AI », + D,, ® D,,)~'}} o vec(V,ZV,, ))V), ]; 


a” px p? 


in which vec(Z) = A, vec(Ay) + vec[2,.F(—1)] and o denotes the Hadamard or element-wise product. 
The limiting behavior of the ridge ML estimator of A with respect to the penalty parameter A, 


is as expected. Given @,, as A,, | 0 the ridge ML estimator A(A,) converges to the unpenalized ML 
estimator of A (assuming it is well-defined). To evaluate the other limit, 1, > oo, note that the eigenval- 


ues of [AT 2.92 + T(0) @ @,)-! equal 1/[A, + Dy) p.7 (D,,);,;,] for j\, 2 = 1,.--, p. Consequently, 


lim, _,., A(A,) = Ao, again given @,. 
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Note that the ridge LS (Least Squares) estimator is obtained from Eq. (2) by setting 2, =I,,., and 
simplifies to: 


A(Qa,) = PyAo + F(-DIA,L,,., + FOP". (3) 


Clearly, the ridge LS estimator does not involve &,. This fact will be exploited to initiate the ridge ML 
estimation of A and @,. 

The VAR(1) model may be viewed as a system of Seemingly Unrelated Regressions (SUR) (Zellner, 
1962), for which Zellner (1962) showed that the LS and ML estimators of the regression coefficients 
(A) coincide. Clearly, the ridge LS and ML estimators of A need not be equal. This can be understood 
as follows. It is well-known that in the presence of parameter constraints ML and LS estimators of 
regression coefficient need not coincide. In particular, Lutkepohl (2005) shows this for the VAR(1) with 
linear equality constraints on A. As the problem of maximization of the ridge penalized log-likelihood 
can be reformulated as a constrained estimation problem, it is readily understood that the ridge ML 
and LS estimators need not coincide. 

To derive the ridge ML estimator of &,, write 


no T 


1 
S, = n(T — 1) 2 » hare ~ AX cial Mads ~ AY, cal 


i=l t=2 


The loss function for 2, is then proportional to: 
5n(T — 1) log|Q,| — 5n(T — 1)tr(S,2,) — 5n(T — 1), tr[(@, — Qo)" (Q, — Q)]. 


The optimum for this loss function is (confer Van Wieringen and Peeters 2016): 


1/2 


2.(2.,,) = 1 I 7 8, ~ 4.2)" a 3(S, ~ 1,Qp)} (4) 


@” pxp 
In (Van Wieringen and Peeters, 2016) it is shown that this solution is positive definite for 4, € (0, oo), 
lim, jo 2.(,.) = Sz! (if it is well-defined) and lim, _,. 2, (2,,) = Qp. Given A, these results directly 
apply here. . 

Joint ridge ML estimation of A and Z, may now be performed by iteratively estimating them 
individually keeping the other temporarily fixed. Pseudo-code of this iterative procedure can found 
in the Supplementary Material (henceforth, SM) I. The convergence of this iterative procedure is 
warranted by, for example Lemma | of Oberhofer and Kmenta (1974) or Proposition A.3 of Lauritzen 
(1996). These results apply directly here as the augmentation of the log-likelihood by the two ridge 
penalties does not affect the continuity of the loss function. Finally, the concavity of ridge penalized 
log-likelihood implies that the convergence is toward a global maximum. 

Finally, one may consider replacing I'(0) in the ridge ML estimator by: oe aan Ye ae 
This estimator of F(0) does not exclude t = 1 in the inner summation above. As such it is more 
efficient. As T is usually small in omics studies with a time-course set-up, this efficiency gain may be 
considerable. 


3.2 Interpretations 


Let us try to understand the effect of ridge penalty on the estimation of A within the VAR(1) model 
(assuming Ay = 0 to simplify the argument, which is analogous but less elegant for nonzero target 
choices). As the ridge estimator is the product of a lag one autocovariance estimate and the inverse of 
a biased covariance estimate, consider the related object: A(A) = P(—1)[F'(0) + al, lee = AZ, (Z, + 


AL. wre Define a new stochastic process by convoluting the VAR(1) process with white noise: 
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Zi1i = Yuri + 4,1; with Y,,; as in the original process (1) and 6, ,; ~N(0,,;, 41 
ance of this process is: Z,, + AL, xp and Z, ,; has the same autocovariance as Y, , ;. A(A) is then the 
product of the autocovariance with lag -1 and inverse of the variance of Z, , ;. Inclusion of the ridge 
penalty could thus be thought of as adding noise to the process. This “adding noise” interpretation 
of the ridge ML estimator of A is confirmed when noting that this estimator can be obtained by data 
augmentation. Hereto simply augment the data array Y with extra multivariate time series to have di- 
mensions p x T x (n +n). The extra time series comprises white noise distributed as V(0,,,.1, AgI,,.,)- 
The expectation of the variance and lag one covariance estimators constructed from the 2n individuals 
is then 5[T(0) + A,1 x pj) and 3T(- 1), respectively. The ridge ML estimator of A can thus be thought 
of as resulting from “white noise” data augmentation. 

The ridge estimator of Z, could be endowed with a similar “data augmentation” interpretation. 


Hereto consider its estimator (the inverse of Eq. (4)) with Q) = 0 The estimator is then a weighted 


pxp) Lhe vari- 


PX ?P* 
sum of the sample covariance matrix and the (matrix-version of the) power mean of S, and \/.,I,,,.,,- 
This estimator could thus be arrived at when the set of estimated residual vectors are augmented with 
vectors sampled from a multivariate normal with covariance equal to the aforementioned power mean. 
In contrast to the data augmentation of the ridge estimator of A, these additional residual vectors are 
not completely uninformative but represent a compromise between white noise and the observed data 
(as forged by the power mean). A nonzero target precision matrix only affects this compromise, not 
the interpretation of data augmentation. 

The ridge ML estimators of A and Q, ae ae a Bayesian interpretation. A normal prior on the 
elements of A, that is vec(A) | 2, ~ NV (0 exis Aq 1p), produces a posterior mean that coincides with 


its ridge estimator: E(A | Y, @,) = A(A,) ene Lutkepohl, 2005). This assumes a zero target, but the 
derivation of Lutkepohl (2005) are analogous for a nonzero target (that would correspond to a nonzero 
mean of the normal prior). Similarly, set a Wishart prior on the precision: 2, ~ W([4n? Alen + 
wS2|"'/2,n). This yields a normal-Wishart posterior with mean E(,|S,) = @,(A,), showing that 
the ridge precision estimator may also be motivated from a Bayesian perspective. The appearance of 
S, in the prior of 2, has an empirical Bayes flavor with the penalty parameter 1, then weighing how 
much the scale parameter of the Wishart prior borrows from the non- and from the fully informative 
case. This derivation uses a zero target precision matrix, but is similar for a nonzero one only making 
calculations more cumbersome. 


3.3. Known temporal conditional independencies 


Prior knowledge of the temporal relations among genes may be available. For instance, it may be 
known whether the expression levels of one gene at time ¢ affects (or not) those of another gene at 
the next instance. Such information is reflected in the (non-)zero entries of A. Alternatively, posterior 
to the estimation of A one may wish to decide on its (non-)zero elements (a procedure to this end is 
detailed in Section 4.1). Afterwards it may be worthwhile to obtain a final (and possibly less biased) 
estimate of A incorporating the inferred temporal conditional independencies as zero elements in A. 
Knowledge on the support A can be formulated as linear equality constraints on A. We discuss 
how this is incorporated in the ridge ML estimation of the VAR(1) model. Let the linear equality 
constraints on A be given by C vec(A) = d, where C and d are g x p? and q x 1 dimensional matrices 
and g the number of constraints. The ridge ML estimator of A, the constrained version of A, solves: 


max LP™(Y; A, Q 


Ay, @ 5 
{A,:CVeC(A,)=d)} 0» 2). ©) 


Qa dos 


Lagrange’s method (employing the dual of problem 5) is used to solve this maximization problem. 
This yields: 
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vec[A.(4,)] = vec[A(a,)1+ [aul exp + £0) @ @.] 


Px p 
CCA A »..2 + FO) @® @J'C TY '{Cvec[A(,)] — d}. (6) 
in which A(A,) is the unconstrained ridge ML estimate. 
To impose a null structure on A denote by Vy C {1,2,..., p’} the indices corresponding to zero 
elements of A (written as a vector vec(A)) and define its complement by Vj = {1, 2,..., DP} \ Y. Now 


set d= 0,,.; with ¢ = |Vo|, the number of zero’s in A. Furthermore, take C = (I,,, P)V,.v uve the 
identity matrix without rows corresponding to the nonzero parameters (i.e. with indices in V)). This 
results in a g x p* dimensional C. 

When A is sparse (q is much closer to p* than to zero) the evaluation of the constraint estimator (6) 
involves inversions of large matrices. These can be avoided. Hereto write © = 1,12... + TO) @Q, 
and assume w.l.o.g. that the first g rows and columns of A,I2,.,2 + TO) @ @, correspond to the 
elements of VY) and the remaining to those of its complement. Then: 


es ( E! eet) 
S -l RT -1 -l RT > 
-E-' FT ((@),.),) 1+ FEF 


where E = (®),, 1, — (O)y yx [yx vel! (@)\» y, and F = [Oy yl! (®)y y,. The second sum- 
mand on the right-hand side of (6) becomes: 
_ _ _ A Cvec[A(a y] 
© 'C'{(C@"'!C"} | Cvec[A(A,)] = aire © 
{ } vec[, ( al [@O)y yI! (@),.y, Cvec(A(,)] 


From this it is clear that the constrained ridge estimator of A can be evaluated by setting the elements of 
vec{A , (A.,,)] corresponding to Vp equal to zero, which does not require any intensive matrix inversions. 
The evaluation of the non-zero elements of vec{A, (A,,)] involves the inversion of a (p* — q) x (p? — q) 
matrix. When A is sparse and q (thus) large, this is feasible. 

We finish with two remarks to further enhance computational efficiency. The submatrix @),. y. can 
be obtained without expanding the Kronecker products involved in ©. To this end note that @,,. 
is a principal submatrix of the sum of two Kronecker products. The principal submatrix of each of 
these Kronecker products can be rewritten as a Hadamard product of the proper elements of A 
and I,,., and of T(0) and Q,. 

The second remark concerns the g x 1-dimensional vector (),x y, Cvec[AQ,)], which may be 
calculated element-wise. Each row of @ is itself the sum of two Kronecker products of rows of A 


alse Dp 


I 
a’ pxp 
andI. and of F(0) and Q,. The rows of © may be generated one at the time, restricted to the 


PpXp 
elements of Vy, and multiplied by Cvec{A(A,)]- This avoids storage of Ory: 

The computation times of the evaluation of estimator (6) using the “sparse” and “dense” procedures 
outlined above were compared using the microbenchmark-package (Mersmann, 2014), while varying 
p and the sparsity percentage. The results (SM I) confirm that the “dense” approach performs best 
for low sparsity percentage, while the roles are reversed for high sparsities. 


3.4 Known contemporaneous conditional independencies 


Next to knowledge on the support of A, one may also have knowledge on that of &,. For instance, 
from knock-out experiments certain contemporaneous conditional independencies (i.e. the absence of 
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a direct/instantaneous interaction between particular gene pairs) may have been inferred. Instanta- 
neous should be operationalized as a much smaller time scale than that represented by the temporal 
interactions implied by A. Here it is shown how such knowledge can be incorporated in the ridge 
ML estimation of 2, under the assumption that its support forms a decomposable graph. In case the 
support of &, does not form a decomposable graph, it may be triangulated (addition of a minimal 
number of edges such that the augmented graph is decomposable). The topological constraint of a 
decomposable graph facilitates a development of a fast algorithm to find the ridge ML estimate of Q, 
subject to known contemporaneous conditional independencies. 

First the concept of a decomposable graph is defined, prior to which several prerequisites from 
graph theory are introduced. A complete graph contains all possible edges. A graph G, = (A, €,) 
is a subgraph of G if AC V and €, C Ax A. A clique is a subgraph of (V,€) that is complete, 
that is every node in the clique is connected to every other node in the clique. In an undirected 
graph a path of length m from a@ to f is a (m+ 1)-tuple of distinct nodes a = v,,...,¥,,,, = 6 such 
that for 1 < j < m both (y,,,, v;) and (y,, v;,;) are in €. A subset S C V separates in graph G the 
subsets A, 5 C Y if all paths from any aw € A and any # € G run through S. The subset S is then 
called a separator. If the subgraph G, of separator S is complete, the disjoint triple (A, B, S) forms a 
decomposition of G. An undirected graph G is then decomposable if either it is complete or it contains 
a decomposition (A, 8, S) such that the subgraphs G,,); and Gp, are both decomposable. SM HI 
contains an illustration of a decomposable graph and its decomposition. Confer the monograph of 
Lauritzen (1996) for more on decomposable graphs. 

When the (conditional independence) graph G induced by a precision matrix 2, is decomposable 
into a set of cliques C and of separators S, the corresponding probability distribution P(G) factorizes 
over these sets, assuming positivity of P(G) (Proposition 3.16, Lauritzen, 1996). Consequently, the log- 
likelihood becomes a linear combination of log-likelihoods associated with cliques and separators. The 
ML estimator of 2, can then be obtained from the ML precision estimates of cliques and separators 
(Proposition 5.9, Lauritzen, 1996). An explicit expression for the ML estimator of &, with associated 
decomposable G thus exists. 

We now turn to the ridge ML estimation of &, associated with a decomposable conditional inde- 
pendence graph G. In limiting cases (1,, = 0, A,, = 00, and S = G) the ridge ML estimator of Q, can 
be expressed as a combination of the ridge ML estimators of the cliques and separators that decom- 
pose G, analogous to unpenalized case specified by Proposition 5.9 of Lauritzen (1996) but with the 
unpenalized marginal ML covariance estimates of cliques and separators replaced by their penalized 
counterparts (details in SM IV). For example, when G decomposes into cliques C,, C,, and S: 


s(C)) SC) 
[2. lo\s.c,\s [@ “le\ss Vic\sixic,\s 
3S BC) om) BG (Oo) BS) GG,” 
2,0.) = | [2.' Isc\s (2. “Iss +12, “Ieg—2. (2, “Isc\s , (7) 
sc) SG), 
De\sxic\s [&. “le\ss [2. “]e\s.c,\s 
a(C,) aC) i ee ; 
where 2, ', 2, ° , and &, ° (with the dependence on 4, dropped on the right-hand side for notational 


clarity) are the marginal ridge ML covariance estimators for cliques C,, C,, and separator S. Although 
only applicable to boundary cases, this estimator may form a reasonable approximation to the estimator 
in the general case. Would the graph G be sparse, the separators are most likely small and S = @ is 
a reasonable approximation. Alternatively, when the sample size is large compared to the number of 
parameters little penalization is required and 2,, = 0 becomes a good approximation. At the other 
end of the scale the data are high-dimensional and only heavy regularization of the precision matrix 
yields a well-conditioned estimator: 4, = oo provides a reasonable first-order approximation. Hence, 
estimator (7) may be used as an educated guess of the general estimator. 

In general, the ridge ML precision estimate is found by invoking a Newton-like procedure to find 
the optimum of the penalized log-likelihood with known support of the precision matrix. This exploits 
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the reformulation of this constrained optimization problem to an unconstrained one, as outlined by 
Dahl et al. (2005) for the unpenalized case. This equivalence generalizes straightforwardly to the ridge- 
penalized case. Hereto let &, have g nonredundant nonzero elements indexed by the set {( Fyes dete =" 


(ignoring the elements of the upper triangular part of 2,). Then define the p x g dimensional matrices: 


col 


vie By) and E 


where e; is a unit vector comprising only of zeros but the j-th element that is one. Each column of 
E,,,, thus contains a one in the same row as the row of corresponding nonzero element of &,. For E,,, 
the one is the same row as the column of the nonzero element. Furthermore, define the g dimensional 
vector x by: 


(G2). ie if Jr, cal Je, ’ 


aa : fork =1,...,¢. 
1 —_ ’ ’ 
oa) aie if Jr, => Je, 


(x); _ 


Effectively, x thus contains the nonzero elements of 2, below and on the diagonal. The precision 
matrix can now be rewritten as: 


Q(x) = E,,,, diag(x) E),, + E,,,, diag(x) Ey, 


TOW col row* 


Substitution of this reparametrized precision matrix in the penalized log-likelihood yields the uncon- 
strained optimization problem: 


mae log |S2, (x)| ~ tr[S,@, (x)] _ ,tr{[Q2, (x) _ Q)]'[2, (x) ~ QI}. 


This can be optimized by means of standard Newton-like procedures like gradient ascent. 

The Newton-like procedure is initiated by the educated guess above. The improvement in the log- 
likelihood by the Newton-like procedure over the initial estimate is often minor and convergence is 
usually achieved quickly. This of course depends on the resemblance of the case at hand and the three 
limiting cases. Next to an informative initial guess, further gain in computational cost may be achieved 
when the graph G is disconnected and the optimization may be done per graph component. 

Decomposability is not strictly necessary. When the support of 2, does not form a chordal graph, 
Dahl et al. (2005) suggest to estimate its nonzero elements by means of chordal embedding, that is 
extending the support to that of smallest encompassing chordal graph. 


3.5 Choice of the penalty parameters 


The penalty parameters are chosen such that they maximize the leave-one-out cross-validated 
(LOOCYV) log-likelihood. The LOOCV log-likelihood is obtained from the leave-one-out ridge es- 
timates of A and Z,, which are based on the leave-one-out sample. This sample is obtained by removal 
of a single element from the set of design points: D = {(f,/):t=1,...,7,i=1,...,n}. The proposed 
ridge estimators of A and Z, need to be modified in order to cope with the unbalancedness of this 
sample. We illustrate this modification for the ridge estimation of A. The modification replaces the 
full sample estimates of T(0) and T'(1) by their leave-one-out counterparts. For left-out design point 
(t,, i,) the latter are (respectively): 


A 1 A 1 
) T = ) T 
Ti, 0 = (0) Wega % eid and Ke, — (1) Vase e103 
j ID\¢ i )| (0) |D (t,,i y| (1) 
eve D\u,.i ) \Gyie Drip) 


£ 
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where |YV| denotes the cardinality of set W and 
Die) IED ED: CIDA Gi), 


Dri H{“LD: GIDED C-lLIDED CDFG.) F C-1, dD}. 


With the leave-one-out estimates A 
by: 


\Cini)) and Zev, j,) at hand, the LOOCYV log-likelihood is obtained 


A A Tal A 
ys log(|¥.\@,4,)) a heres - A\ ai) Yet,-1i] Ze \ ei LY at, ai i A\ i) Yet,-li,} 
(t,,i,)€D 


For notational clarity the dependency of these estimates on the penalty parameters 1, and A,, has been 
suppressed. The couple (A, 2,,) that maximizes the LOOCV log-likelihood is the optimal choice for 
the penalty parameters. 


a’ 


4 Post-estimation analysis 


With knowledge of the parameters of the VAR(1) model not all is evident. In particular, the conse- 
quences of the dynamics within the VAR(1) model are not explicit once A and Z, have been estimated. 
Comprehension of the temporal interactions within a pathway requires understanding of the implica- 
tion of fluctuations in the expression of gene j at time ¢ on transcript levels of some gene at a particular 
future time point. The subsections below discuss statistics measuring precisely that. 


4.1 Support determination 


With estimates of the parameters A and Q, available, their nonzero elements are of main interest. 
The proposed ridge estimates of A and @, are, however, nonsparse and their support needs to be 
determined a posteriori. Hereto we employ an empirical Bayes testing procedure proposed by Efron 
(2004) and extended and implemented by Strimmer (2008). The first step of this procedure summarizes 
the evidence against the null hypothesis, for example Hy : (A); j, = 9, ina test statistic. For elements 


of A this is: 
4= AQ, MEA; 37, 


For support determination of the off-diagonal part of &, the partial correlation is used as test statistic 
(confer Van Wieringen and Peeters, 2016). The procedure assumes these test statistics follow a mixture 
distribution: Z ~ my fy(Z) + (1 — m9) f,(Z), where components f)(Z) and f,(Z) are the densities of 
the test statistic under H, and its alternative, while zy is the mixing proportion. Then, assuming that 
most elements of AQ,) originate from Hp, the multiplicity of these p x p test statistics is exploited 
to estimate /)(Z). Hereto test statistics larger than the significance threshold are censored, while the 
remaining test statistics are assumed to belong to the null and are used for the estimation of f,(Z) 
through a truncated maximum likelihood approach (confer Strimmer, 2008). With an estimate of /)(Z) 
at hand, the posterior probability P(H) | Z = z) (which can be endowed with a local FDR (€FDR) 
interpretation Efron, 2004) is estimated. This probability, of a particular element of A (or 2,) being 
“uninteresting,” given the observed test statistic, is used to decide on the presence of an element of A 
(or @,). 
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A conventional cut-off choice on the posterior probability of being “uninteresting” would be to select 
only those elements with such a probability smaller than 0.05. In our experience this may sometimes 
produce a rather dense support of the parameters for large p and a more stringent cut-off may be more 
appropriate for initial screening purposes. Vice versa, for small p the conventional cut-off may identify 
few to none “interesting” elements and a more liberal choice may be preferable. 


4.2 Impulse response analysis 


Within the field of econometrics the change in one variable at time ¢ on variables at a time ¢ + t (for 
t > 0) is quantified by the impulse response function. This function describes the effect of an impulse 
(a unit change) in e, , ; at time ¢ on the response Y, ,,, ; at time ¢ +t. This function is exactly the 
derivative of Y, ,,, ; with respect to e, ,;: JY, ,,,; / 0, ,;. Iterative substitution of model (1) reveals 
that its impulse response function equals A* for t = 0,1, 2,... (confer Hamilton, 1994). Summary 
statistics derived from the elements of the powers of A yield insight into the most important variates 
of the process. Impulse response analysis of a pathway thus allows one to assess the change in gene 
expression levels at future time points due to a change in the expression levels of some gene at the current 
time. The former results from direct propagation of the gene expression change at the current time 
point through all possible, directed routes in the time series chain graph (avoiding contemporaneous 
edges). Impulse response analysis thus enables one to make a prediction on, for example the effect of 
gene knock-outs. 


4.3 Mutual information 


The impulse response analysis does not take into account the contemporaneous conditional indepen- 
dencies. To accommodate these we use the mutual information, a generalized “correlation” measure, 
between the variate 7 at some time and all variates at a future time point, given the past. More precise, 
the mutual information /(Y Yiail¥. ) is defined as (Cover and Thomas 2006): 


&t+T,2? t—Li 


1(Y Yiril Yar = AX! Mari — AV teil Vii Yarn) (8) 


*,t+7,i? *C+T 1 


where [(X, Y | Z) denotes the mutual information between random variables ¥ and Y given Z and 
H(X |Z) the entropy of X given Z. To evaluate the mutual information it thus suffices to evaluate 
the entropy of expression levels at time ¢ + t, conditional on the past (and the current expression level 
of gene 7). The entropy of a multivariate normal random variable is equal to the logarithm of the 
determinant of its covariance matrix. For the first summand on the right-hand side of (8), this is: 


AW areeelYaota) = Tog (Wary re/1¥ar10)) = log (| 7 Aa’z,(49"|). 
=0 


For the second summand the required covariance matrix is: 


t—1 
Var (Y, p47. | Yitw Y, s-1,) = A‘Var(Y,, ; ; | eit Y,¢-1,,)(A")" ar > A’, (A‘)', 
1=0 
with the j-th row and column of Var(Y,, ; ;| €;.1,;: Yx;—1,;) Comprising zeros only. Its submatrix without 


the j-th row and column equals Var(¥,\ ;,.;1 €j1i Yer—1,) = (Ze)\j\7 — Be CB), Cd). a8 
is readily obtained from Theorem 6.5 of Bickel and Doksum (2001). The mutual information thus 
measures the association between the current expression level of gene j and those of all genes in the 
pathway at a future time point (given the past). It does so through comparing the (logarithm of the) 
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generalized variance (i.e. the determinant of the variance) of the gene expression levels at some time 
point, un- and conditionally on knowledge of the expression levels of a gene prior to that time. Added 
value of this knowledge, implying a reduction in the generalized variance, yields a positive mutual 
information. Large mutual informations pinpoint to genes with a strong down-stream effect on the 
pathway, and may be interpreted as “drivers” of the pathway. 


4.4 Path decomposition of the covariance 


Path analysis is another way to study the downstream effect of a signal. It decomposes the covariance 
between two variates at different time points into the contribution of all paths within the time-series 
chain graph that connect these variates. For this decomposition write the covariance between ae i 


and Y; +r fort > Oas: 


Dp 
Yeo = (AD), = DE), 7, AI," 


j=l 


Cov(Y; 


ptt? Y, t+t,i 


In this (A*),, ; captures the contribution of all temporal paths that connect variates j’ and j, over 


t time units. Further, the term involving the error covariance matrix can be broken down into the 
contributions of the contemporaneous paths (invoking a result from Jones and West, 2005): 


41 Fetl(2.)\ pe] 7 
ei, Jy => ( Nae det (2, ) Ie. aah” 


PeP 


Jy Ja 


where P, , is the set of all contemporaneous paths from j, to j, and P = {(p, = J), Px). (P2 P3), 

, (P,,» P, = j>)}, a path of length r from j, to j,. Hence, the covariance between ¥, ,; and Y, rer 
can thus be decomposed as the instantaneous propagation of the innovation at time ¢ via the con- 
temporaneous conditional dependencies, followed by the response to the contemporaneous changes 
through to temporal relations among the variates. 

The observed covariance between the expression levels of gene 7 at the time ¢ and those of another 
gene at a later time point can thus be dissected into the contribution of the paths (comprising both 
directed/temporal, and undirected/contemporaneous edges) connecting them in the time-series chain 
graph. The paths with the largest contributions are most explanatory for the propagation of a signal 
occurring in gene j at the time ¢ to the other gene at later time. In particular, when comparing the 
sign of the contribution of the paths to that of the observed covariance, one may distinguish between 
“mediating” (concordant signs) and “moderating” (opposing signs) paths. 


5 Simulation: Ridge versus SCAD 


The ridge ML estimator of the VAR(1) model is compared to its SCAD counterpart (Abegaz and 
Wit, 2013), which has been implemented in the R-package SparseTSCGM, by means of simulation. 
me two methods are compared in terms of squared Frobenius loss of the estimates (e.g. |A — All = 

a ae [(A),. — (A), , })and the sensitivity and specificity of their edge selection of the time-series 
chain eraph, 

We first describe the set-up of the simulation study. Datasets are generated with various choices of 
the VAR(1) model parameters. For the regression coefficient matrix A a hub, clique, random (with 
two sparsity levels), and complete network structure is used. For the precision matrix @, a banded, 
complete supported matrices are used. Details of these choices can be found in SM VU. Furthermore, 


isty| 
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the number of variates p, time points J and the number of samples n are varied in accordance with 
a factorial design. The number of variates equals either p = 25 or p = 50 representing the size of 
non-trivial but well-defined pathways. Simultaneously, we set n € {5, 15} and T € {10, 20}. We stress 
that the particular sample size of (n = 5, T = 10) is the most challenging but also most relevant as its 
design is closest to that which is most prevalent in practice (confer Ernst et al., 2005 and SM VI). For 
each combination of model and study design parameter choices data are simulated in accordance with 
the VAR(1) model. For each such combination 50 time-course datasets are generated in this fashion. 

For each simulated dataset, the VAR(1) model is fitted using both the ridge and SCAD estimation 
procedures. The parameter estimates obtained from both methods are those using “optimal” penalty 
parameter values. Adhering to the ceteris paribus principle, they are chosen in identical fashion. As the 
implementation of the SCAD method prohibits LOOCYV (as it cannot handle missing time points), we 
resort to k-fold cross-validation with k = n. Effectively, this leaves out a single time-course, a Y,, , ;, at 
the time. Resulting SCAD estimates of A and &, are sparse, whereas their ridge counterparts are not. 
Only when comparing the methods with respect to the ability to reconstruct the time series chain graph, 
the ridge estimates undergo postestimation sparsification. Again obeying the ceteris paribus principle, 
the ridge procedure selects not by the aforementioned empirical Bayes approach but the same number 
of edges as found by the SCAD method from the largest elements elements of its estimate of A. This 
may favor the SCAD method, but ensures comparability of the sensitivity and specificity. 

We first discuss the Frobenius loss of the SCAD and nonsparsified ridge estimates, with a focus on 
that of A. The Frobenius loss of both estimates of A for datasets generated with various combinations 
of A and &, are compared by means of boxplots (all plots are deferred to SM VI). The simulation 
results with respect to loss can be summarized by the following points: 


e The empirically most prevalent experimental design (J = 10, 1 = 5 with p = 25) reveals that the 
Frobenius loss of the SCAD estimates of A exceeds that of its ridge counterparts for both sparsity 
levels and a full true A, irrespectively of the choice of true A. 

e Still concentrating on the most prevalent sample size (J = 10, n = 5), the loss difference grows 
larger in favor of the ridge estimates when the number of variates increases to p = 50, again 
irrespective of the true A but also the sparsity level. 

¢ When the number of time points increases to T = 20 the ridge estimates are still to be preferred over 
their SCAD counterparts (except for a 25 x 25 dimensional A with the sparsest hub structure). 

e Only for the lowest dimensional (p = 25) and sparse choices of A in combination with large 
sample sizes (1 = 15, T = 20) the loss difference turns to the advantage of the SCAD method. 
Clearly, then sparsity is detectable and the SCAD profits. But this advantage is lost to the ridge 
estimator when p increases or the sample size is reduced. Or, when A is less sparse, confer the case 
with a full A. 

e When turning to the Frobenius loss of the estimates of &,, again the ridge estimator is generally to 
be preferred. This is in line with Van Wieringen and Peeters (2016), who compare the Frobenius 
loss of the ridge precision estimator with its lasso counterpart (Friedman et al., 2008). 


We now turn to the sensitivity and specificity of the ridge procedure and that of the SCAD procedure 
(extensive results in SM VII). Recall, to make the sensitivity and specificity more comparable, the ridge 
procedure selects (from the largest elements elements of its estimate of A) the same number of edges 
as found by the SCAD method, thus favoring the latter. In the general picture that then emerges the 
methods are on a par. For the lower sample sized (and thereby more prevalent) settings a difference 
is virtually absent. More discriminative power is present when n = 15 and T = 20, but even then no 
clear winner appear (although then the hub case appears to favor the SCAD method). 

In all, the proposed ridge estimator of the VAR(1) model is a worthy competitor to the SCAD 
estimator of Abegaz and Wit (2013). With respect to the Frobenius loss the ridge estimator seems 
even preferable, while for more high-dimensional settings the edge selection properties of the ridge 
estimator not inferior to that of its SCAD counterpart. 
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6 Illustration 


The use of the proposed methodology is illustrated on p53 signaling pathway data from the aforemen- 
tioned HPV-induced transformation study. This pathway may serve as a potential positive control. 
During HPV-induced transformation the tumor suppressor p53, central to the pathway, is bound and 
targeted for degradation by a viral oncoprotein. When p53 is inactivated, altered signaling cascades are 
expected in this pathway. Ideally, this is reflected in the experimental data, from which then temporal 
and contemporaneous relations among the pathway’s genes could be uncovered. 

The p53 signaling data from the aforementioned experiment contains p = 64 genes, measured at 
T =8 time points in m = 4 cell lines. Their preprocessing comprises RMA background correction 
(Irizarry et al., 2003) and robust quantile between-array normalization (Bolstad et al., 2003), followed 
by a variance stabilizing transformation (Huber et al., 2002). Probes interrogating genes mapping to 
the p53 signaling pathway (defined by the KEGG repository, Kanehisa and Goto, 2000) are selected. 
When multiple probes map to the same gene their data are averaged. 

To learn the time-series chain graph of the p53 signaling pathway the VAR(1) model parameters are 
estimated as outlined in the Section 3. Optimal penalty parameters (A,, = 427.95465, A,, = 0.00259) 
are determined through maximization of the LOOCV log-likelihood. The optimum is confirmed by 
a contour plot of the LOOCV log-likelihood surface (confer SM VIII). The resulting estimates are 
visualized as heatmaps (SM VIII). 

To facilitate biological insight genes with the strongest temporal and contemporaneous relations 
are singled out through postestimation support determination of A and , (confer Section 4.1). For 
the selection of nonzero elements, corresponding to edges in the time-series chain graph, of A and Q, 
a cut-off on the posterior probability of being “uninteresting,” P(H) | Z = z), of 0.05 is applied. This 
high cut-off results in a sparse and interpretable graph. Figure 1 displays the resulting support of A 
as a heatmap. It reveals the presence of genes with either a horizontal or vertical red stripy pattern. 
Genes (e.g. CCNG1, CGKN2A, SERPINE1, SESN2, STEAP3) with the horizontal stripes can be 
thought of as the “regulatees” in the pathway, while the vertical ones (e.g. IGF1, IGFBP3, BBC3, 
CCND2, THBS1) are the “regulators” of the pathway. The absence of TP53 (central to the pathway 
to which it lends its name) among the “regulators” is a negative control. Under normal conditions it is 
central to the p53 signaling pathway, but in our cell line model it is indeed inactived (by the insertion 
of HPV and its consequences). The suggested distinction between “regulators” and “regulatees” is in 
line with the node statistics (like in- and out-degree, betweenness, centrality, confer Newman, 2010 for 
definitions) derived from the inferred graph (see Table 2, SM VIII). In the remainder we concentrate 
on the “regulators” driving the pathway. 

The aforementioned “regulators” are known either to be pivotal to the p53 signaling pathway or 
to be involved in HPV-induced carcinogenesis. For instance, IGF1 (insulin-like growth factor 1) is a 
growth factor that regulates proliferation of both normal and cancer cells in an autocrine, paracrine 
(Baxter, 2000), and potentially endocrine manner (D’Ercole and Calikoglu, 2001). HPV transformed 
keratinocytes were found to be addicted to IGF1 signaling for survival, underlining the importance 
of this pathway for cervical cancer development (Geiger and Levitzki, 2007). IGFBP3, central to this 
signaling pathway, is transcriptionally regulated by p53 and (among others) suppresses proliferation 
and induces apoptosis (the process of programmed cell death). In our model lacking functional p53 
due to presence of the HPV oncoprotein E6, IGFBP3 is downregulated (data not shown) and, thus, 
does no longer inhibit carcinogenesis. Downregulation of THBS1, a potent angiogenesis inhibitor, was 
described before after expression of the viral oncogenes E6 and E7 in keratinocytes of different origins 
(Toussaint-Smith et al., 2004). Lowered expression was also observed in cervical cancers Kodama et al. 
(2001). 

With knowledge of their support less biased estimates of A and @, are obtained by refitting them 
taking the support into account. Optimal penalty parameters are redetermined: A,, = 2.55082, 2,, = 
0.00583 (and confirmed by the contour plot the LOOCV log-likelihood surface, SM VIII). Reestimated 


© 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com 


187 


Biometrical Journal 59 (2017) 1 


Histogram of the correlation fit vs. observation 


Inferred support of A 


Aguenbes4 


. . . 
. . J 
See ge @ s ase 
. = . 
. = 
= . 
s ". 
. . a 
. . 
i id i 
s . = 
. 
. ae ., . 


0.5 


0.0 


-0.5 


Correlation 


Figure 1 Top row, left: the inferred support of A; right: the histogram of gene-wise and cell line-wise 


correlation of fit and observation. Bottom panel: the time-series chain graph for the p53 signaling 


pathway. 
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Figure 2 Left panel, (1): incoherent FFL motif formed by the gene-triple IGFBP3, RPRM, and SFN. 
The solid (dashed) edge represent a positive (negative) effect. Left panel, (11): same as (i) but “unrolled.” 
Right panel: gene expression data of the IGFBP3, RPRM, and SFN genes over time in the four cell 
lines. 


parameters are visualized as a heatmap (SM VIII). Using these estimates the time-series chain graph 
is constructed (Fig. 1). This is in line with the above suggested interpretation of “regulators” and 
“regulatees.” 

Employing the reestimated parameter A, we study the fit (Y, aS AY, i 7-1) Of the “regulatees” (genes 
explained by other genes). The fit is studied for all genes in the pathway. The result is summarized in 
Fig. 1. It displays the histogram of the Spearman correlations between the fit and the observations, 
cell line-wise. The histogram in Fig. 1 is clearly skewed to the domain [0, 1]. This indicates that the fit 
is generally reasonable. For a more tangible assessment of the fit it is displayed for several individual 
genes in SM VIII. 

With final and less-biased estimates of the VAR(1) parameters at hand, we study the quantitative, 
dynamic implications of the model: what are the downstream effects of a change in expression levels 
of a gene? This can be done through impulse response analysis (confer Section 4.2). For each gene the 
column-wise average of the (absolute) impulse response on all other genes at the next time instance is 
calculated (Table 2, SM VIII). This is a measure of a gene’s driving force on the pathway’s expression lev- 
els. The low and high impulse responses of the {CCNG1I, CGKN2A, SERPINEI, SESN2, STEAP3} 
and {IGF1, IGFBP3, BBC3, CCND2, THBS1} genes corroborate with their interpretations of “reg- 
ulatees” and “regulators.” This is supported when evaluating the mutual information between each 
gene at time point ¢ and the whole pathway at the next time point (Table 2, SM VIII). 

The downstream effects of a signal may be further elucidated through the decomposition of the 
covariance between the expression levels of two genes in terms of the paths connecting them in the 
time-series chain graph (as described in Section 4.4). For illustration purposes consider the “regulator”- 
“regulatee” pair (IGF1, SESN2) at two contiguous time points. They are connected through two paths: 
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a direct path (Yigri.1 > Ysesn2,t41) and an indirect one through Yigri, > Yrrmot > Ysesnz,t41- The 
covariance of the (IGF1, SESN2) gene pair at contiguous time points may now be decomposed as: 
Cov(Yicri.t» Ysesnz.tz1) = (2A! Gri sesn2 = —0-05315 = —0.02203 — 0.03112, in which the two 
summands on the right-hand side correspond to the direct and indirect path in the time-series chain 
graph. Based on the paths’ contributions the indirect one dominates (surprisingly), but being of the 
same sign and (approximately) of the same size the direct path also contributes in the suppression of 
the expression levels of the SESN2 gene. 

Another way to grasp the inferred networks of the p53 signaling pathway is to study its motifs. 
Motifs are small recurring network patterns and form the building blocks of pathways (Alon, 2007). It 
takes to far afield to study all motifs present in the inferred network (although it is relatively sparse) in 
depth. For the purpose of illustration we single out the feedforward loop (FFL), which appears in many 
gene systems (Alon, 2007). FFL motifs come in the coherent and incoherent variety: the latter connects 
two genes via two paths that have opposite effects (positive/activating and negative/repressing), while 
in the former both paths have the same effect. In an incohorent FFL motif found in the reconstructed 
time-series chain graph of the p53 signaling pathway gene IGFBP3 represses RPRM and activates 
SEN, while (to a lesser degree) RPRM stimulates SFN gene (Fig. 2). IGFBP3 and RPRM thus affect 
SFN in opposite ways. In the extreme case, when the effect of the RPRM is equal to that of IGFBP3, 
this results (with a slight delay due to the time) in the repression of the SFN, reducing its expression 
levels to (virtually) zero (confer Alon (2007)). In any case, the effect of IGFBP3 on SFN is moderated 
by that of RPRM (as is corroborated by the data (Fig. 2)). 


7 Conclusion 


The reconstruction of the time-series chain graph associated with a VAR(1) model from high- 
dimensional omics data was studied. To this end, we presented a ridge penalized, full maximum 
likelihood estimation framework. The resulting estimator was simplified algebraically to ensure its 
computationally efficient evaluation from high-dimensional data. The estimator allows the incorpora- 
tion of prior knowledge in two ways. First, the penalty of both VAR(1) model parameters includes a 
user-chosen target, representing a suggestion of the actual parameter value toward which the parame- 
ter is shrunken. Secondly, knowledge on the absence of certain cross-temporal and contemporaneous 
edges of the time-series chain graph may be included. Various novel ways to make use of the thus 
estimated VAR(1) model were presented: (i) edge selection of the time-series chain graph from the 
estimated model, (ii) mutual information analysis across time points, and (iii) decomposition of the 
covariance between nodes across time points in terms of the paths of the time-series chain graph con- 
necting them. Then, the proposed ridge estimation procedure was compared in simulation to a SCAD 
penalized estimator of the time-series chain graph. The ridge estimation procedure outperformed its 
SCAD counterpart in terms of loss and showed to be competitive in terms of specificity and sensitivity, 
in particular for small sample sizes (as is the prevalent experimental design for practice). Finally, the 
use of the presented methodology is illustrated on a time-course experiment using an HPV-induced 
transformation cell line model. 

Further biological questions to be (partially) answered using the high-dimensional omics data of the 
aforementioned cell line experiment require extensions of the proposed ridge estimation framework to 
more intricate vector autoregressive models. For instance, the full experiment also encompasses data 
from other molecular levels like DNA copy number and microRNA expression. Inclusion of these 
levels, both affecting mR NA expression, may be done within a VAR X(1) model, a VAR(1) model with 
time-varying covariates (hence, the Y in VARX), which models the variation of mRNA expression 
over time by mRNA expression of the preceding time point and for example DNA copy number of 
the current time point. Another extension would address the fact that each cell line in the experiment 
had a different HPV inserted to initiate oncogenesis. This may lead to differences in the cell lines’ 
time-series chain graphs. To detect such differences in regulatory architecture a VAR(1) model may 
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be assumed per cell line and fitted with a fused ridge penalty (Bilgrau et al., 2015). Such a penalty 
fuses what is similar among the cell lines and preserves substantial differences among them. Future 
work will center around these and related topics to extract all relevant biological information from 
integrative time-course omics studies such as the cervical cancer one. 
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